Skip to content

[ci] Add surefire fork timeouts to prevent CI hangs#6186

Open
joewiz wants to merge 1 commit intoeXist-db:developfrom
joewiz:bugfix/ci-surefire-timeouts
Open

[ci] Add surefire fork timeouts to prevent CI hangs#6186
joewiz wants to merge 1 commit intoeXist-db:developfrom
joewiz:bugfix/ci-surefire-timeouts

Conversation

@joewiz
Copy link
Copy Markdown
Member

@joewiz joewiz commented Mar 26, 2026

Summary

When a test like DeadlockIT or MoveResourceTest hangs during CI, the surefire/failsafe forked JVM waits indefinitely — the only protection is the GitHub Actions step timeout at 45 minutes. This burns CI minutes and blocks PR merges.

This PR adds surefire fork timeouts so hung tests are killed after 10 minutes instead of 45.

What changed

exist-parent/pom.xml

Added to both maven-surefire-plugin and maven-failsafe-plugin configuration:

<forkedProcessTimeoutInSeconds>600</forkedProcessTimeoutInSeconds>
<forkedProcessExitTimeoutInSeconds>60</forkedProcessExitTimeoutInSeconds>
  • forkedProcessTimeoutInSeconds=600: Kills the forked JVM after 10 minutes. Clean test runs complete in ~3.5 minutes, so this only fires on hung tests.
  • forkedProcessExitTimeoutInSeconds=60: Gives the fork 60 seconds to flush results before force-kill.

.github/workflows/ci-test.yml

Reduced integration test step timeout from 45 to 30 minutes. With surefire killing hung forks at 10 minutes, 30 minutes provides ample buffer.

Evidence the fix works

The Windows CI run on this PR proves the timeout infrastructure works:

Metric Before (no timeout) After (this PR)
DeadlockIT Hung for 44 min, killed by step timeout Completed in 522s (under 600s limit)
Windows integration total 49 min (step timeout) 14 min 55s
Tests lost Fork killed mid-run, results lost All 9 tests passed, 0 failures

The remaining BUILD FAILURE is a fork exit hang (BrokerPool/BlobStore shutdown delay) — the fork completed all tests successfully but didn't exit within the 60s forkedProcessExitTimeoutInSeconds. This is addressed by companion PR #6183 (bounded BlobStore join() timeouts).

Why these values

Our hang experiment (Round 3) showed:

  • Clean test suite runs complete in 3:26–3:51 (local) and ~16 min (CI with all modules)
  • Hung tests (DeadlockIT, MoveResourceTest) run indefinitely without a timeout
  • 600 seconds (10 min) is generous enough to never fire on healthy tests but aggressive enough to recover quickly from hangs

Test plan

🤖 Generated with Claude Code

Configure forkedProcessTimeoutInSeconds=600 and
forkedProcessExitTimeoutInSeconds=60 in both maven-surefire-plugin
and maven-failsafe-plugin in exist-parent/pom.xml. This kills forked
JVMs that hang (e.g. DeadlockIT, MoveResourceTest) after 10 minutes
instead of waiting indefinitely for the 45-minute GitHub Actions step
timeout.

Also reduce the integration test step timeout from 45 to 30 minutes
in ci-test.yml — with surefire killing hung forks at 10 minutes,
30 minutes is plenty for the full integration suite.

Clean runs complete in ~3.5 minutes; the 600s timeout is a safety
net that only fires on hung tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@joewiz joewiz requested a review from a team as a code owner March 26, 2026 05:19
@dizzzz dizzzz requested review from a team, duncdrum, line-o and reinhapa March 26, 2026 21:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants